The file ‘WeatherEDA_202005_v002.Rmd’ contains exploratory data analysis for historical weather data as contained in METAR archives hosted by Iowa State University.
Data have been dowloaded, processed, cleaned, and integrated for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.
This module will perform initial modeling on the processed weather files. It builds on the previous ‘WeatherModeling_202006_v001.Rmd’ and leverages functions in ‘WeatherModelingFunctions_v001.R’.
There are three main processed files available for further exploration:
metar_postEDA_20200617.rds and metar_postEDA_extra_20200627.rds
metar_modifiedClouds_20200617.rds and metar_modifiedclouds_extra_20200627.rds
metar_precipLists_20200617.rds and metar_precipLists_extra_20200627.rds
Several mapping files are defined for use in plotting; tidyverse, lubridate, and caret are loaded; and the relevant functions are sourced:
# The process frequently uses tidyverse, lubridate, caret, and randomForest
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# The main path for the files
filePath <- "./RInputFiles/ProcessedMETAR/"
# Sourcing functions
source("./WeatherModelingFunctions_v001.R")
# Descriptive names for key variables
varMapper <- c(source="Original source file name",
locale="Descriptive name",
dtime="Date-Time (UTC)",
origMETAR="Original METAR",
year="Year",
monthint="Month",
month="Month",
day="Day of Month",
WindDir="Wind Direction (degrees)",
WindSpeed="Wind Speed (kts)",
WindGust="Wind Gust (kts)",
predomDir="General Prevailing Wind Direction",
Visibility="Visibility (SM)",
Altimeter="Altimeter (inches Hg)",
TempF="Temperature (F)",
DewF="Dew Point (F)",
modSLP="Sea-Level Pressure (hPa)",
cType1="First Cloud Layer Type",
cLevel1="First Cloud Layer Height (ft)",
isRain="Rain at Observation Time",
isSnow="Snow at Observation Time",
isThunder="Thunder at Obsevation Time",
tempFHi="24-hour High Temperature (F)",
tempFLo="24-hour Low Temperature (F)",
minHeight="Minimum Cloud Height (ft)",
minType="Obscuration Level at Minimum Cloud Height",
ceilingHeight="Minimum Ceiling Height (ft)",
ceilingType="Obscuration Level at Minimum Ceiling Height",
hr="Hour of Day (Zulu time)"
)
# File name to city name mapper
cityNameMapper <- c(katl_2016="Atlanta, GA (2016)",
kbos_2016="Boston, MA (2016)",
kdca_2016="Washington, DC (2016)",
kden_2016="Denver, CO (2016)",
kdfw_2016="Dallas, TX (2016)",
kdtw_2016="Detroit, MI (2016)",
kewr_2016="Newark, NJ (2016)",
kgrb_2016="Green Bay, WI (2016)",
kgrr_2016="Grand Rapids, MI (2016)",
kiah_2016="Houston, TX (2016)",
kind_2016="Indianapolis, IN (2016)",
klas_2014="Las Vegas, NV (2014)",
klas_2015="Las Vegas, NV (2015)",
klas_2016="Las Vegas, NV (2016)",
klas_2017="Las Vegas, NV (2017)",
klas_2018="Las Vegas, NV (2018)",
klas_2019="Las Vegas, NV (2019)",
klax_2016="Los Angeles, CA (2016)",
klnk_2016="Lincoln, NE (2016)",
kmia_2016="Miami, FL (2016)",
kmke_2016="Milwaukee, WI (2016)",
kmsn_2016="Madison, WI (2016)",
kmsp_2016="Minneapolis, MN (2016)",
kmsy_2014="New Orleans, LA (2014)",
kmsy_2015="New Orleans, LA (2015)",
kmsy_2016="New Orleans, LA (2016)",
kmsy_2017="New Orleans, LA (2017)",
kmsy_2018="New Orleans, LA (2018)",
kmsy_2019="New Orleans, LA (2019)",
kord_2014="Chicago, IL (2014)",
kord_2015="Chicago, IL (2015)",
kord_2016="Chicago, IL (2016)",
kord_2017="Chicago, IL (2017)",
kord_2018="Chicago, IL (2018)",
kord_2019="Chicago, IL (2019)",
kphl_2016="Philadelphia, PA (2016)",
kphx_2016="Phoenix, AZ (2016)",
ksan_2014="San Diego, CA (2014)",
ksan_2015="San Diego, CA (2015)",
ksan_2016="San Diego, CA (2016)",
ksan_2017="San Diego, CA (2017)",
ksan_2018="San Diego, CA (2018)",
ksan_2019="San Diego, CA (2019)",
ksat_2016="San Antonio, TX (2016)",
ksea_2016="Seattle, WA (2016)",
ksfo_2016="San Francisco, CA (2016)",
ksjc_2016="San Jose, CA (2016)",
kstl_2016="Saint Louis, MO (2016)",
ktpa_2016="Tampa Bay, FL (2016)",
ktvc_2016="Traverse City, MI (2016)"
)
# File names in 2016, based on cityNameMapper
names_2016 <- grep(names(cityNameMapper), pattern="[a-z]{3}_2016", value=TRUE)
The main data will be from the metar_postEDA files. They are integrated below, and cloud and ceiling heights are converted to factors:
# Main weather data
metarData <- readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_20200617.rds") %>%
bind_rows(readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_extra_20200627.rds")) %>%
mutate(orig_minHeight=minHeight,
orig_ceilingHeight=ceilingHeight,
minHeight=mapCloudHeight(minHeight),
ceilingHeight=mapCloudHeight(ceilingHeight)
)
glimpse(metarData)
## Observations: 437,417
## Variables: 43
## $ source <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_201...
## $ locale <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Det...
## $ dtime <dttm> 2016-01-01 00:53:00, 2016-01-01 01:53:00, 2016-...
## $ origMETAR <chr> "KDTW 010053Z 23012KT 10SM OVC020 00/M05 A3021 R...
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ monthint <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ month <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ WindDir <chr> "230", "230", "230", "240", "230", "220", "220",...
## $ WindSpeed <int> 12, 12, 11, 14, 16, 13, 14, 16, 13, 16, 17, 13, ...
## $ WindGust <dbl> NA, NA, NA, 23, 22, NA, 20, 20, NA, 22, NA, NA, ...
## $ predomDir <fct> SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, ...
## $ Visibility <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 8, 5, 7, 8, 10, ...
## $ Altimeter <dbl> 30.21, 30.21, 30.19, 30.19, 30.18, 30.16, 30.14,...
## $ TempF <dbl> 32.00, 32.00, 32.00, 30.92, 30.92, 32.00, 30.92,...
## $ DewF <dbl> 23.00, 21.92, 21.02, 19.94, 19.94, 19.94, 19.94,...
## $ modSLP <dbl> 1023.6, 1023.5, 1023.0, 1023.0, 1022.7, 1022.0, ...
## $ cType1 <chr> "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC",...
## $ cType2 <chr> "", "", "", "", "", "", "", "", "", "OVC", "OVC"...
## $ cType3 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType4 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType5 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType6 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cLevel1 <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ cLevel2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4500, 4000, ...
## $ cLevel3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel6 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ isRain <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isSnow <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isThunder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ p1Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, N...
## $ p36Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, 0, NA...
## $ p24Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFHi <dbl> NA, NA, NA, NA, 36, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFLo <dbl> NA, NA, NA, NA, 31, NA, NA, NA, NA, NA, NA, NA, ...
## $ minHeight <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ minType <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ ceilingHeight <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ ceilingType <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ orig_minHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ orig_ceilingHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
An initial exploration of the data can use hierarchical clustering based on several numerical features of the data (temperature, dew point, sea-level pressure, wind speed) by month.
Example code includes:
# Create hierarchical clustering for 2016 data
# Distance is calculated only where month is the same
distData <- metarData %>%
filter(year==2016) %>%
select(locale, month, TempF, DewF, modSLP, WindSpeed) %>%
group_by(locale, month) %>%
summarize_if(is.numeric, mean, na.rm=TRUE) %>%
ungroup() %>%
mutate(rowname=paste0(locale, "_", month)) %>%
column_to_rownames() %>%
select(-locale, -month) %>%
as.matrix() %>%
scale() %>%
dist() %>%
as.matrix() %>%
as.data.frame() %>%
rownames_to_column(var="locale1") %>%
pivot_longer(-locale1, names_to="locale2", values_to="dist") %>%
mutate(city1=str_replace(locale1, "_\\w{3}", ""),
city2=str_replace(locale2, "_\\w{3}", ""),
month1=str_sub(locale1, -3),
month2=str_sub(locale2, -3)) %>%
filter(month1==month2) %>%
group_by(city1, city2) %>%
summarize(meandist=mean(dist), sddist=sd(dist)) %>%
ungroup() %>%
select(-sddist) %>%
pivot_wider(city1, names_from="city2", values_from="meandist") %>%
column_to_rownames(var="city1") %>%
as.matrix() %>%
as.dist(diag=FALSE)
distData %>%
hclust(method="complete") %>%
plot()
distData %>%
hclust(method="single") %>%
plot()
At a glance, there are several sensible findings from the clustering:
There are a handful of cities that do not seem to cluster with anything else:
Suppose that the only information available about two cities were their temperatures and dew points, with month also included. How well would a basic random forest, with mtry=3, classify the cities?
The function is then run for every combination of locales from 2016 in cityNameMapper. A common random seed is applied to every run of the process:
# Create a container list to hold the output
list_2016_TempF_DewF_month <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))
set.seed(2006281443)
n <- 1
for (ctr in 1:(length(names_2016)-1)) {
for (ctr2 in (ctr+1):length(names_2016)) {
list_2016_TempF_DewF_month[[n]] <- rfTwoLocales(metarData,
loc1=names_2016[ctr],
loc2=names_2016[ctr2],
vrbls=c("TempF", "DewF", "month"),
ntree=25
)
n <- n + 1
if ((n %% 50) == 0) { cat("Through number:", n, "\n")}
}
}
## Through number: 50
## Through number: 100
## Through number: 150
## Through number: 200
## Through number: 250
## Through number: 300
## Through number: 350
## Through number: 400
Accuracy can then be assessed:
# Create a tibble from the underlying accuracy data
acc_TempF_DewF_month <- map_dfr(list_2016_TempF_DewF_month, .f=helperAccuracyLocale)
# Assess the top 10 classification accuracies
acc_TempF_DewF_month %>%
arrange(-accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Denver, CO (2016) Miami, FL (2016) 0.997 0.997 0.997
## 2 Las Vegas, NV (2016) Miami, FL (2016) 0.990 0.991 0.989
## 3 Miami, FL (2016) Seattle, WA (2016) 0.985 0.985 0.985
## 4 Denver, CO (2016) Tampa Bay, FL (2016) 0.984 0.986 0.982
## 5 Miami, FL (2016) Traverse City, MI (201~ 0.980 0.981 0.980
## 6 Miami, FL (2016) Minneapolis, MN (2016) 0.978 0.978 0.978
## 7 Boston, MA (2016) Miami, FL (2016) 0.976 0.975 0.976
## 8 Denver, CO (2016) New Orleans, LA (2016) 0.975 0.975 0.975
## 9 Miami, FL (2016) Phoenix, AZ (2016) 0.974 0.973 0.975
## 10 Green Bay, WI (2016) Miami, FL (2016) 0.971 0.972 0.971
# Assess the bottom 10 classification accuracies
acc_TempF_DewF_month %>%
arrange(accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Newark, NJ (2016) Philadelphia, PA (20~ 0.580 0.591 0.570
## 2 Chicago, IL (2016) Milwaukee, WI (2016) 0.591 0.600 0.583
## 3 Chicago, IL (2016) Madison, WI (2016) 0.598 0.622 0.573
## 4 Detroit, MI (2016) Grand Rapids, MI (20~ 0.608 0.568 0.649
## 5 Chicago, IL (2016) Detroit, MI (2016) 0.610 0.633 0.587
## 6 Grand Rapids, MI (201~ Traverse City, MI (2~ 0.613 0.630 0.597
## 7 Madison, WI (2016) Milwaukee, WI (2016) 0.618 0.625 0.611
## 8 Green Bay, WI (2016) Madison, WI (2016) 0.619 0.586 0.652
## 9 Detroit, MI (2016) Madison, WI (2016) 0.619 0.639 0.598
## 10 Chicago, IL (2016) Grand Rapids, MI (20~ 0.621 0.647 0.595
allAccuracy_month <- select(acc_TempF_DewF_month,
locale=locale1,
other=locale2,
accOverall,
accLocale=accLocale1
) %>%
bind_rows(select(acc_TempF_DewF_month,
locale=locale2,
other=locale1,
accOverall,
accLocale=accLocale2
)
)
# Overall accuracy by location plot
allAccuracy_month %>%
group_by(locale) %>%
summarize_if(is.numeric, mean) %>%
ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) +
geom_point(size=2) +
geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
coord_flip() +
labs(x="",
y="Average Accuracy",
title="Average Accuracy Predicting Locale",
subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
caption="Temperature, Dew Point, Month as predictors\n(50% is baseline coinflip accuracy)"
) +
ylim(c(0.5, 1))
# Overall accuracy heatmap (FIX for better ordering of locales)
# allAccuracy_month %>%
# ggplot(aes(x=locale, y=other)) +
# geom_tile(aes(fill=accOverall)) +
# theme(axis.text.x=element_text(angle=90)) +
# scale_fill_continuous("Accuracy", high="darkblue", low="white") +
# labs(title="Accuracy Predicting Locale vs. Locale",
# caption="Temperature, Dew Point, and Month as predictors\n(50% is baseline coinflip accuracy)",
# x="",
# y=""
# )
Accuracy is best for cities with significantly different locales. The model is especially successful at pulling apart the desert cities (Las Vegas, Phoenix) from everything else, and the humid Florida cities (Miami, Tampa Bay) from everything else.
Next, the simple model is run to classify locale across the full 2016 dataset:
# Run random forest for all 2016 locales
rf_all_2016_TDm <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF", "month"),
locs=names_2016,
ntree=50,
seed=2006281508
)
Summaries can then be created for the accuracy in predicting each locale:
evalPredictions(rf_all_2016_TDm, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")
## # A tibble: 898 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 484 0.182
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 49 0.0185
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 39 0.0147
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 179 0.0675
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 34 0.0128
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 42 0.0158
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 48 0.0181
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 42 0.0158
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 96 0.0362
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 49 0.0185
## # ... with 888 more rows
Accuracy is meaningfully increased compared to the null accuracy, though there is still under 50% accuracy in classifying any locale. This is suggestive that the goal will be to define some archetype climates, and to use those for predicting (e.g., humid, cold, desert, etc.).
Clouds (minimum levels and ceiling heights) can potentially help further differentiate the cold weather cities on the downwind side of the lake, as well as helping to further pull apart various marine and mid-continental climates.
An updated random forest model is then run:
# Run random forest for all 2016 locales
rf_all_2016_TDmc <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF", "month", "minHeight", "ceilingHeight"),
locs=names_2016,
ntree=50,
seed=2006281509
)
The evaluation process is converted to a function:
evalPredictions(rf_all_2016_TDmc, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")
## # A tibble: 897 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 643 0.240
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 84 0.0313
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 61 0.0227
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 145 0.0541
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 44 0.0164
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 51 0.0190
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 49 0.0183
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 26 0.00969
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 82 0.0306
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 100 0.0373
## # ... with 887 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDm, currObject=rf_all_2016_TDmc)
Accuracy increases meaningfully, though still well under 50% in most cases.
The model can also be built out to consider wind speed and wind direction. No attempt yet is made to control for over-fitting:
# Run random forest for all 2016 locales
rf_all_2016_TDmcw <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=names_2016,
ntree=50,
seed=2006281934
)
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcw,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction"
)
## # A tibble: 889 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1096 0.415
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 34 0.0129
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 46 0.0174
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 135 0.0511
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 28 0.0106
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 31 0.0117
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 26 0.00984
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 24 0.00908
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 58 0.0219
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 59 0.0223
## # ... with 879 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmc, currObject=rf_all_2016_TDmcw)
Including wind significantly improves model accuracy for many locales. Even the cold weather cities are now being predicted with around 30%-40% accucacy.
The model can also be built out to consider hour of day and sea-level pressure. No attempt yet is made to control for over-fitting, with the exception that mtry is held at 4:
# Run random forest for all 2016 locales
rf_all_2016_TDmcwa <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=names_2016,
ntree=50,
mtry=4,
seed=2006282103
)
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcwa,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 880 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1639 0.625
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 16 0.00610
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 23 0.00877
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 75 0.0286
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 8 0.00305
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 18 0.00686
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 12 0.00457
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 6 0.00229
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 45 0.0172
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 51 0.0194
## # ... with 870 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmcw, currObject=rf_all_2016_TDmcwa)
Adding sea-level pressure again significantly improves prediction ability. The cold weather cities are in the roughly 40%-60% accuracy range, and the other cities are in the roughly 60%-80% accuracy range.
Based on the hierarchical clustering and results of the initial random forest models, a few archetype cities are defined in an attempt to pull out distinctions:
A data subset is created, with “hr” added for the Zulu hour of the observation (as integer rather than as factor):
archeCities <- c("kmia_2016", "klas_2016", "klax_2016", "kden_2016",
"ksea_2016", "kdfw_2016", "kmsp_2016", "katl_2016"
)
archeData <- metarData %>%
filter(source %in% archeCities) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales
rf_arche_2016_TDmcwa <- rfMultiLocale(archeData,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291353
)
##
## Running for locations:
## [1] "katl_2016" "kden_2016" "kdfw_2016" "klas_2016" "klax_2016" "kmia_2016"
## [7] "kmsp_2016" "ksea_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 62 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 2204 0.835
## 2 Atlanta, GA (2016) Dallas, TX (2016) FALSE 139 0.0527
## 3 Atlanta, GA (2016) Denver, CO (2016) FALSE 39 0.0148
## 4 Atlanta, GA (2016) Las Vegas, NV (2016) FALSE 33 0.0125
## 5 Atlanta, GA (2016) Los Angeles, CA (2016) FALSE 86 0.0326
## 6 Atlanta, GA (2016) Miami, FL (2016) FALSE 50 0.0189
## 7 Atlanta, GA (2016) Minneapolis, MN (2016) FALSE 45 0.0170
## 8 Atlanta, GA (2016) Seattle, WA (2016) FALSE 44 0.0167
## 9 Dallas, TX (2016) Atlanta, GA (2016) FALSE 203 0.0789
## 10 Dallas, TX (2016) Dallas, TX (2016) TRUE 2105 0.818
## # ... with 52 more rows
The model is reasonably successful in pulling apart the archetypes. Denver and Dallas seem potentially less distinct, though all archetypes cities are predicted at 80%+ accuracy.
The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa,
fullData=metarData,
archeCities=archeCities,
sortDescMatch=TRUE
)
There appear to be several issues:
The first issue is addressed by collapsing Atlanta, Dallas, and Miami and instead using Houston as the archetype; and replacing Minneapolis with Chicago:
archeCities_v02 <- c("kiah_2016", "klas_2016", "klax_2016",
"kden_2016", "ksea_2016", "kord_2016"
)
archeData_v02 <- metarData %>%
filter(source %in% archeCities_v02) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales in archeData_v02
rf_arche_2016_TDmcwa_v02 <- rfMultiLocale(archeData_v02,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291448
)
##
## Running for locations:
## [1] "kden_2016" "kiah_2016" "klas_2016" "klax_2016" "kord_2016" "ksea_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v02,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL (2016) Chicago, IL (2016) TRUE 2321 0.872
## 2 Chicago, IL (2016) Denver, CO (2016) FALSE 99 0.0372
## 3 Chicago, IL (2016) Houston, TX (2016) FALSE 54 0.0203
## 4 Chicago, IL (2016) Las Vegas, NV (2016) FALSE 13 0.00489
## 5 Chicago, IL (2016) Los Angeles, CA (2016) FALSE 70 0.0263
## 6 Chicago, IL (2016) Seattle, WA (2016) FALSE 104 0.0391
## 7 Denver, CO (2016) Chicago, IL (2016) FALSE 119 0.0453
## 8 Denver, CO (2016) Denver, CO (2016) TRUE 2282 0.869
## 9 Denver, CO (2016) Houston, TX (2016) FALSE 3 0.00114
## 10 Denver, CO (2016) Las Vegas, NV (2016) FALSE 126 0.0480
## # ... with 26 more rows
The model is reasonably successful in pulling apart the archetypes. Houston appears the most distinct (as had Miami previously), and there is less overlap going to/from Denver or Seattle.
The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa_v02,
fullData=metarData,
archeCities=archeCities_v02,
sortDescMatch = TRUE
)
At a glance, the archetypes are more sensible. There is a bit of Denver and a bit of Seattle in most of the cold weather cities, and a bit of Houston in some of the warmer cold weather cities. This is suggestive that there are either more then one type of cold-weather cities, or that cold-weather cities tend to be like other archetypes are certain times.
The model is run again, deleting Denver and Seattle, and converting Los Angeles to San Diego:
archeCities_v03 <- c("kiah_2016", "klas_2016", "ksan_2016", "kord_2016")
archeData_v03 <- metarData %>%
filter(source %in% archeCities_v03) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales in archeData_v03
rf_arche_2016_TDmcwa_v03 <- rfMultiLocale(archeData_v03,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291459
)
##
## Running for locations:
## [1] "kiah_2016" "klas_2016" "kord_2016" "ksan_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v03,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL (2016) Chicago, IL (2016) TRUE 2511 0.946
## 2 Chicago, IL (2016) Houston, TX (2016) FALSE 69 0.0260
## 3 Chicago, IL (2016) Las Vegas, NV (2016) FALSE 26 0.00979
## 4 Chicago, IL (2016) San Diego, CA (2016) FALSE 49 0.0185
## 5 Houston, TX (2016) Chicago, IL (2016) FALSE 44 0.0170
## 6 Houston, TX (2016) Houston, TX (2016) TRUE 2473 0.954
## 7 Houston, TX (2016) Las Vegas, NV (2016) FALSE 19 0.00733
## 8 Houston, TX (2016) San Diego, CA (2016) FALSE 57 0.0220
## 9 Las Vegas, NV (2016) Chicago, IL (2016) FALSE 29 0.0113
## 10 Las Vegas, NV (2016) Houston, TX (2016) FALSE 13 0.00508
## 11 Las Vegas, NV (2016) Las Vegas, NV (2016) TRUE 2482 0.971
## 12 Las Vegas, NV (2016) San Diego, CA (2016) FALSE 33 0.0129
## 13 San Diego, CA (2016) Chicago, IL (2016) FALSE 35 0.0130
## 14 San Diego, CA (2016) Houston, TX (2016) FALSE 29 0.0108
## 15 San Diego, CA (2016) Las Vegas, NV (2016) FALSE 63 0.0235
## 16 San Diego, CA (2016) San Diego, CA (2016) TRUE 2559 0.953
These archetypes are distinct, with accuracies all in the 95% range. The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa_v03,
fullData=metarData,
archeCities=archeCities_v03,
sortDescMatch=TRUE
)
Quite a few cities show as blends of the archetype cities, which is likely correct if archetypes are considered to be Cold, Humid, Marine, Desert. An open question is whether to expand that list with another 1-2 archetypes that attempt to better capture Atlanta, Dallas, Denver, St Louis, and DC.
The Marine archetype may need to be converted to Pacific and to better identify the California entities.
A decision about Seattle is needed, as it is a little like everything and everything is a little like it.
To address the issue of the model learning from specific cities rather than archetypes, user-defined clusters will be created, and data from these clusters will be used in modeling:
Further, a variable will be created for “hr”, the Zulu hour of the observation:
locale2016Mapper <- c('Cold-MI', 'Newark', 'Cold', 'Cold-MI', 'Humid', 'Cold', 'Desert', 'Lincoln', 'Cold', 'Cold', 'Cold', 'Humid', 'Cold', 'Marine', 'Cold-MI')
names(locale2016Mapper) <- c('Detroit, MI (2016)', 'Newark, NJ (2016)', 'Green Bay, WI (2016)', 'Grand Rapids, MI (2016)', 'Houston, TX (2016)', 'Indianapolis, IN (2016)', 'Las Vegas, NV (2016)', 'Lincoln, NE (2016)', 'Milwaukee, WI (2016)', 'Madison, WI (2016)', 'Minneapolis, MN (2016)', 'New Orleans, LA (2016)', 'Chicago, IL (2016)', 'San Diego, CA (2016)', 'Traverse City, MI (2016)')
tibble::tibble(locale=names(locale2016Mapper), mapping=locale2016Mapper) %>%
arrange(mapping, locale)
modData <- modData %>%
mutate(hr=lubridate::hour(dtime),
locType=locale2016Mapper[locale]
)
modData %>%
count(locType, year) %>%
pivot_wider(locType, names_from="year", values_from="n")
The data are the filtered so that there are an equal number of observations from each locale type. The model can later be predicted against the remaining items:
# Set a seed for reporducibility
set.seed(2006211352)
# Find the smallest locale type
nSmall <- modData %>%
filter(!is.na(locType)) %>%
count(locType) %>%
pull(n) %>%
min()
# Create the relevant data subset
subData <- modData %>%
filter(!is.na(locType)) %>%
group_by(locType) %>%
sample_n(size=nSmall, replace=FALSE) %>%
ungroup()
# Sumarize the data subset
subData %>%
count(locale) %>%
arrange(-n)
The previous model can then be run on the data subset:
# Run random forest for all 2016 locale types
rf_types_2016_TDmcw <- rfMultiLocale(subData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="locType",
pred="locType",
ntree=50,
seed=2006211401
)
The evaluation process can again be run:
evalPredictions(rf_types_2016_TDmcw,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction",
keyVar="locType"
)
Predictive ability is strongest for the well-differentiated types, and weakest for the more poorly differentiated types. The model is about 50% accurate in classifying cold climates; 25% of the time they are classified as they other type of cold climate, and 25% of the time they are classified as something else (typically Lincoln or Newark).
The modeling is run again using a much smaller training dataset (testSize set to 0.9) to see the implication of a much smaller data volume:
# Run random forest for all 2016 locale types
rf_types_2016_small_TDmcw <- rfMultiLocale(subData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="locType",
pred="locType",
ntree=50,
seed=2006211419,
testSize=0.9
)
The evaluation process can again be run:
evalPredictions(rf_types_2016_small_TDmcw,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction",
keyVar="locType"
)
As expected, predictions are less accurate with a smaller training dataset. Overall acuuracy falls from ~70% to ~60% with the much smaller training data volumes. This suggests the model is continuing to learn from the data, and may benefit from an expanded sample.
Next, the model is adapted to include the hour (as an integer, which may need to be rethought, though this will generally capture whether it is daytime or nighttime even without factoring) and the sea-level pressure variable:
# Run random forest for all 2016 locale types - add hour and modSLP, limit mtry to 4
rf_types_2016_TDmcwha <- rfMultiLocale(subData,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="locType",
pred="locType",
ntree=50,
seed=2006211423,
mtry=4
)
The evaluation process can again be run:
evalPredictions(rf_types_2016_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="locType"
)
Accuracy increases to 75%. In particular, the model is better able to distinguish Lincoln and Newark from the remaining locations, which has a follow-on effect of improving cold weather classifications.
Exploration is then run on distinguishing two locales from each other, to see the implications of parameters like data volume (controlled through testSize) and the number of trees.
Comparisons will be run as follows:
ntrees <- c(10, 25, 50, 100, 250, 500)
testSizes <- c(0.9, 0.7, 0.3)
desertHumid <- vector("list", length(ntrees) * length(testSizes))
n <- 1
for (ntree in ntrees) {
for (testSize in testSizes) {
desertHumid[[n]] <- rfTwoLocales(subData,
loc1="Desert",
loc2="Humid",
locVar="locType",
vrbls=c("TempF", "DewF", "month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir", "modSLP"
),
pred="locType",
seed=2006211432,
ntree=ntree,
mtry=4,
testSize=testSize
)
desertHumid[[n]]$ntree <- ntree
desertHumid[[n]]$testSize <- testSize
n <- n+1
}
}
An assessment of accuracy can then be performed:
df <- sapply(desertHumid,
FUN=function(x) { c(x[["errorRate"]]["OOB"], ntree=x$ntree, testSize=x$testSize) }
) %>%
t() %>%
tibble::as_tibble()
df %>%
mutate(accuracy=1-OOB, trainSize=1-testSize) %>%
ggplot(aes(x=ntree, y=trainSize)) +
geom_text(aes(label=paste0(round(100*accuracy), "%"))) +
scale_x_log10()
As expected, accuracy is very high for distinguishing Desert and Humid climates, even with small data volumes (training size 10% of the data) and a small number of trees.
The approach is then run for the cold weather cities:
ntrees <- c(10, 25, 50, 100, 250, 500)
testSizes <- c(0.9, 0.7, 0.3)
coldColdMI <- vector("list", length(ntrees) * length(testSizes))
n <- 1
for (ntree in ntrees) {
for (testSize in testSizes) {
coldColdMI[[n]] <- rfTwoLocales(subData,
loc1="Cold",
loc2="Cold-MI",
locVar="locType",
vrbls=c("TempF", "DewF", "month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir", "modSLP"
),
pred="locType",
seed=2006211501,
ntree=ntree,
mtry=4,
testSize=testSize
)
coldColdMI[[n]]$ntree <- ntree
coldColdMI[[n]]$testSize <- testSize
n <- n+1
}
}
An assessment of accuracy can then be performed:
dfColdColdMI <- sapply(coldColdMI,
FUN=function(x) { c(x[["errorRate"]]["OOB"], ntree=x$ntree, testSize=x$testSize) }
) %>%
t() %>%
tibble::as_tibble()
dfColdColdMI %>%
mutate(accuracy=1-OOB, trainSize=1-testSize) %>%
ggplot(aes(x=ntree, y=trainSize)) +
geom_text(aes(label=paste0(round(100*accuracy), "%"))) +
scale_x_log10()
Greater data volumes and greater tree depths are more helpful for distinguishing cold weather city types from each other. The smallest forest has a 56% accuracy (null 50%) while the largest has a 68% accuracy. This suggests some potential upside to continuing the cold weather modeling process with more cities added for greater data volumes.
Data from years other than 2016 can be fed in to the model, with predictions made to see whether the model is learning general differences in climate or specific features about 2016.
Data that were not included in the modeling can be predicted, with the results assessed:
Example code includes:
nonUsedData <- modData %>%
anti_join(select(subData, source, dtime))
nonUsedData %>%
mutate(localeName=str_replace(locale, pattern=" .{1}\\d{4}.*", replacement="")) %>%
count(localeName, year) %>%
pivot_wider(localeName, names_from="year", values_from="n")
The function can then be applied to the 2016 data:
# Modify the file for mapping locale to type
localeMapper <- locale2016Mapper
names(localeMapper) <- str_replace(names(localeMapper), pattern=" .\\d{4}.", replacement="")
# Predictions on 2016 data
helperPredictPlot(rf_types_2016_TDmcwha$rfModel,
df=filter(nonUsedData, year==2016),
predOrder=c("Cold", "Cold-MI", "Lincoln", "Newark", "Marine", "Humid", "Desert"),
locMapper=localeMapper
)
The predictions seem in-line with the test dataset conclusions, as would be expected given that the “unused” 2016 data are randomly sampled and should be no different than that “test” 2016 dataset.
The function can then be applied to the 2015 and 2017 data:
# Predictions on 2015/2017 data
helperPredictPlot(rf_types_2016_TDmcwha$rfModel,
df=filter(nonUsedData, year!=2016),
predOrder=c("Cold", "Cold-MI", "Lincoln", "Newark", "Marine", "Humid", "Desert"),
locMapper=localeMapper
)
Model performance on 2015 and 2017 data is not as strong, with roughly a 10%-20% loss of accuracy. Predictions are still much better than null accuracy, and the model (mostly) continues to separate the three well-differentiated types (Desert, Humid, Marine).
However, the model struggles to classify Chicago, placing it roughly equally as Cold (correct), Cold-MI, Lincoln, and Newark. This suggests cold-weather cities may be predicted on specific anomalies of 2016 (e.g., where the cold snaps hit). More data may help the model generalize better, specifcally to separate recurring and generalizable climate features of Chicago from weather anomalies that happened to hit Chicago in 2016.
Suppose that models are run on all 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:
# Create the subset for Chicago, Las Vegas, New Orleans, San Diego (should have 2015, 2016, 2017)
sub_2015_2017_data <- modData %>%
filter(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan")) %>%
mutate(city=str_replace(locale, pattern=" .\\d{4}.", replacement=""))
# Check that proper locales are included
sub_2015_2017_data %>%
count(city, locale)
# Run random forest for 2015-2017 data
rf_types_2015_2017_TDmcwha <- rfMultiLocale(sub_2015_2017_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="city",
pred="city",
ntree=25,
seed=2006221334,
mtry=4
)
evalPredictions(rf_types_2015_2017_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="city"
)
Even with a very small forest (25 trees), the model is almost always separating Las Vegas, Chicago, San Diego, and New Orleans. While the climates are very different in these cities, it is striking that the model has so few misclassifications.
How do other cities map against these classifications?
# Predictions on 2015/2017 data
helperPredictPlot(rf_types_2015_2017_TDmcwha$rfModel,
df=filter(modData, !(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))),
predOrder=c("Chicago, IL", "San Diego, CA", "New Orleans, LA", "Las Vegas, NV")
)
The prediction process is re-run excluding the altimeter (modSLP) data:
# Run random forest for 2015-2017 data (exclude modSLP)
rf_types_2015_2017_TDmcwh <- rfMultiLocale(sub_2015_2017_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="city",
pred="city",
ntree=25,
seed=2006221334,
mtry=4
)
evalPredictions(rf_types_2015_2017_TDmcwh,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="city"
)
Performance drops, since altimeters are meaningfully different in high altitude (Las Vegas) and sea-level (New Orleans and San Diego) locations.
Variable importances are plotted:
helperPlotVarImp(rf_types_2015_2017_TDmcwha$rfModel)
helperPlotVarImp(rf_types_2015_2017_TDmcwh$rfModel, titleAdd=" (Excludes SLP)")
Dew point and temperature by month continue to be strong factors for separating the four cities in this analysis. SLP, minimum cloud height, and prevailing wind direction are also meaningful.
The assessment can be run for the two main 2015-2017 models:
# Run for the full model including SLP
probs_2015_2017_TDmcwha <-
assessPredictionCertainty(rf_types_2015_2017_TDmcwha,
keyVar="city",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP",
showAcc=TRUE
)
# Run for the model excluding SLP
probs_2015_2017_TDmcwh <-
assessPredictionCertainty(rf_types_2015_2017_TDmcwh,
keyVar="city",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind",
showAcc=TRUE
)
Predictions with 80%+ of the votes are made ~80% of the time, and these predictions are ~99% accurate. Predictions with <80% of the votes are made ~20% of the times, and these predictions are ~75% accurate. The percentage of votes received appears to be a reasonable proxy for the confidence of the prediction.
A similar process can be run for assessing the classification of the other cities against the 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:
useData <- modData %>%
filter(!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan")))
# Run for the model excluding SLP
probs_allcities_2015_2017_TDmcwh <-
assessPredictionCertainty(rf_types_2015_2017_TDmcwh,
testData=useData,
keyVar="locale",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind",
showHists=TRUE
)
Patterns in the predictions stand out:
The actual run is cached since it takes a meaningful amount of time:
# Define the variables to be considered
possVars <- c("TempF", "DewF", "month", "hr", "minHeight",
"ceilingHeight", "WindSpeed", "predomDir", "modSLP"
)
# Create a container for storing the best random forest and variable added from each run
rfContainer <- vector("list", length=length(possVars))
rfVarAdds <- vector("character", length=length(possVars))
# Run the functions using a for loop over the length of possVars
for (ctr in 1:length(possVars)) {
# Pull in the results of the previous run
if (ctr==1) {
prevVars <- c()
} else {
prevVars <- rfVarAdds[1:(ctr-1)]
}
# Run each of them through the combinations
tmpList <- helperRFCombinations(possVars, df=sub_2015_2017_data, prevVars=prevVars)
# Assess the performance
tmpAccuracy <- helperVariableAccuracy(tmpList, possVars=possVars, prevVars=prevVars)
# Prepare to repeat the process
bestRow <- tmpAccuracy %>%
filter(locale=="OOB") %>%
filter(accuracy==max(accuracy))
bestRow
# Update the rfContainer and rfVarAdds elements
rfContainer[[ctr]] <- tmpList[[as.integer(pull(bestRow, vrblNum))]]
rfVarAdds[ctr] <- pull(bestRow, vrblName)
cat("\nVariable Added:", rfVarAdds[ctr], "\n")
}
The evolution of accuracy can then be plotted:
# Pull the accuracy data from the variables selected
tblAccuracy <- map_dfr(rfContainer, .f=helperExtractAccuracy, .id="vrblNum") %>%
mutate(vrblName=rfVarAdds[as.integer(vrblNum)],
locale=factor(locale, levels=c("OOB", unique(locale)[unique(locale) != "OOB"])),
plotLabel=ifelse(as.integer(vrblNum)==1, vrblName, paste0("add ", vrblName))
)
# Plot the gains in accuracy, facetted by 'locale'
ggplot(tblAccuracy, aes(x=fct_reorder(plotLabel, -as.integer(vrblNum)))) +
geom_text(aes(y=accuracy, label=paste0(round(100*accuracy), "%"))) +
facet_wrap(~locale, nrow=1) +
ylim(c(0, 1)) +
geom_hline(yintercept=0.25, lty=2) +
labs(x="",
y="",
title="Evolution of accuracy as next-best variables are added",
caption="Null accuracy is 25%"
) +
coord_flip()
As seen previously, dew point, temperature, and month are significant differentiating variables, driving roughly 75% accuracy in classifying the archetypes.
Adding altimeter (SLP) bumps the accuracy by another ~10%, while adding minimum cloud height and prevailing wind direction bump the accuracy by another ~5%.
This shows some of the power of the random forest algorithm as it is given additional variables to explore.
Further, the OOB estimates provided by the modeling are pessimistic relative to performance on the hold-out sample (which has 94%-96% accuracy rather than the estimated 89%-93% accuracy). This is driven partly by rfTwoLocales() pulling the average error rate rather than the final error rate, an oversight that carries throughout the analysis. The next version of this modeling should fix the approach to classifying errors.
Evolution of error can be plotted to see the impact:
oobError <- c()
for (ctr in 1:9) {
oobError <- c(oobError, rfContainer[[ctr]]$rfModel$err.rate[, "OOB"])
}
tibble::tibble(nVars=rep(1:9, each=25),
ntrees=rep(1:25, times=9),
accuracy=1-oobError
) %>%
ggplot(aes(x=ntrees, y=accuracy)) +
geom_line(aes(group=nVars, color=factor(nVars))) +
labs(x="Number of Trees",
y="OOB Accuracy",
title="Accuracy improves with more trees and more variables"
) +
scale_color_discrete("# Variables")
It will be interesting to see how the four archetypes models perform on other large US cities. A separate routine is written to source the key download and EDA functions and to run them for these key data.